Simultaneous Clustering of Multiple Gene Expression and Physical Interaction Datasets
نویسندگان
چکیده
Many genome-wide datasets are routinely generated to study different aspects of biological systems, but integrating them to obtain a coherent view of the underlying biology remains a challenge. We propose simultaneous clustering of multiple networks as a framework to integrate large-scale datasets on the interactions among and activities of cellular components. Specifically, we develop an algorithm JointCluster that finds sets of genes that cluster well in multiple networks of interest, such as coexpression networks summarizing correlations among the expression profiles of genes and physical networks describing protein-protein and protein-DNA interactions among genes or gene-products. Our algorithm provides an efficient solution to a well-defined problem of jointly clustering networks, using techniques that permit certain theoretical guarantees on the quality of the detected clustering relative to the optimal clustering. These guarantees coupled with an effective scaling heuristic and the flexibility to handle multiple heterogeneous networks make our method JointCluster an advance over earlier approaches. Simulation results showed JointCluster to be more robust than alternate methods in recovering clusters implanted in networks with high false positive rates. In systematic evaluation of JointCluster and some earlier approaches for combined analysis of the yeast physical network and two gene expression datasets under glucose and ethanol growth conditions, JointCluster discovers clusters that are more consistently enriched for various reference classes capturing different aspects of yeast biology or yield better coverage of the analysed genes. These robust clusters, which are supported across multiple genomic datasets and diverse reference classes, agree with known biology of yeast under these growth conditions, elucidate the genetic control of coordinated transcription, and enable functional predictions for a number of uncharacterized genes.
منابع مشابه
Modification of the Fast Global K-means Using a Fuzzy Relation with Application in Microarray Data Analysis
Recognizing genes with distinctive expression levels can help in prevention, diagnosis and treatment of the diseases at the genomic level. In this paper, fast Global k-means (fast GKM) is developed for clustering the gene expression datasets. Fast GKM is a significant improvement of the k-means clustering method. It is an incremental clustering method which starts with one cluster. Iteratively ...
متن کاملThe Simultaneous Effect of Aerobic Training and Octopamine on the Mitophagy of Brown Adipose Tissue after Induction of Intoxication through Deeply Heated Oils in Male Wistar Rats
Introduction: Deep heated oils produce toxins that endanger people's health. The current study aims to investigate the simultaneous effect of aerobic training and Octopamine consumption on the mitophagy of brown adipose tissue after the induction of intoxication through deeply heated oil in male wistar rats. Method: 40 male wistar rats after four weeks of feedi...
متن کاملThe Effect of Drought Stresses, Fusarium Culmorum and Heterodera Filipjevi and their Interactions on the Expression Pattern of Transcription Factor Gene NAC69-3 in Bread Wheat
SExtended Abstract Introduction and Objective: Small grain cereals such as wheat, are affected by types of destructive environmental factors such as abiotic and biotic stresses that severely reduce crop yields. To cope with these conditions, transcription factors cause plant resistance to these stresses by activating or suppressing the expression of genes involved in the resistance responses....
متن کاملCollective Analysis of Multiple High - Throughput Gene Expression Datasets
Modern technologies have resulted in the production of numerous high-throughput biological datasets. However, the pace of development of capable computational methods does not cope with the pace of generation of new high-throughput datasets. Amongst the most popular biological high-throughput datasets are gene expression datasets (e.g. microarray datasets). This work targets this aspect by prop...
متن کاملMammalian Eye Gene Expression Using Support Vector Regression to Evaluate a Strategy for Detecting Human Eye Disease
Background and purpose: Machine learning is a class of modern and strong tools that can solve many important problems that nowadays humans may be faced with. Support vector regression (SVR) is a way to build a regression model which is an incredible member of the machine learning family. SVR has been proven to be an effective tool in real-value function estimation. As a supervised-learning appr...
متن کامل